DCG Induction using MDL and Parsed
نویسنده
چکیده
We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon estimation. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However , best results are achieved when the learner is supervised with a compression-based prior and a parsed corpus.
منابع مشابه
MDL-based DCG Induction for NP Identification
We introduce a learner capable of automatically extending large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle , and can be trained upon either just raw text, or else raw text additionally annotated with parsed corpora. As a demonstration of the learner, we show how full Noun Phrases (NPs that m...
متن کاملUnsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported res...
متن کاملLogic Program Induction using MDL and MAP: An Application to Grammars
Probabilistic programs provide an appealing language for describing mental theories, because they are Turing complete: any computable process may be described as a program. Program induction is the problem of inferring theories, in the form of (probabilistic) programs, that describe some set of observations. Minimum Description Length, or MDL, is one common approach to program induction [11]. T...
متن کاملUnsupervised Word Induction Using Mdl Criterion
Unsupervised learning of units (phonemes, words, phrases, etc.) is important to the design of statistical speech and NLP systems. This paper presents a general source-coding framework for inducing words from natural language text without word boundaries. An efficient search algorithm is developed to optimize the minimum description length (MDL) induction criterion. Despite some seemingly over-s...
متن کامل